require(adespatial)
require(rubias)
require(hierfstat)
require(scico)
require(ade4)
require(marmap)
require(vegan)
require(ggmap)
require(pegas)
require(adegenet)
require(tidyverse)
require(knitr)
require(magrittr)
This notebook is part of an rstudio project. If you’d like to pre-rendered figures, read a summary of analysis and view code, please open the relevant html file in a browser.
The full project with data (except raw sequencing data) and results is archived as a github repository at [https://github.com/david-dayan/chum_coastal_pilot]
Here we conduct the analysis for the Oregon Coastal Chum Salmon Genetics Pilot Study (2019-2021) Oregon Department of Fish and Wildlife Conservation and Recovery Program using the Oke350 GTseq panel.
This notebook contains analysis logs and results from the four objectives of this project:
(1) Collect and analyze samples from the three largest coastal chum salmon populations (Nehalem, Tillamook, and Yaquina), and two additional basins (Siletz and Netarts) that often have a substantial number of spawners, to investigate genetic structure of coastal chum salmon populations.
(2) Collect and analyze samples from two major Tillamook sub-basins (Kilchis and Miami) to investigate whether there is significant genetic structure within the basin.
(3) Collect tissue samples from different anatomical locations of chum carcasses to investigate whether certain tissues are more likely to provide higher quality samples for analysis.
(4) Analyze a small number of archival chum scale samples to evaluate the potential for investigating historical population structure, using the large number of chum scale samples ODFW has collected through spawning grounds surveys over time.
We also answer a lingering question from a previous sequencing run using these new data: why did so many samples fail in the panel optimization run?
Sequencing Data
Run 1:
The first run was the test plate for the Oke350 panel. Raw compressed reads are at /dfs/Omalley_Lab/dayan/coastal_chinook_2020/full_panel/demux/keta_reads .
95 samples library from four populations spiked into a larger coastal chinook run. Demuxed with deML. The total number of reads was 50,680,864 .
Run 2:
Raw sequencing data is at /dfs/Omalley_Lab/dayan/chum_pilot_2021 and /dfs/Omalley_Lab/fitz/Runs/4774 . It was demultiplexed by the sequencing center
Raw data is from the SFGL Illumina 020 run.
285 samples total (including controls, replicates and samples from run 1 that required resequencing)
Intermediate data/results are in the directory /dfs/Omalley_Lab/dayan/chum_pilot_2021
Samples
Before filtering (duplicates removed) there were 209 samples, sample size per basin and spawning tributary (if available) is below.
meta_data <- readxl::read_xlsx("metadata/Oke GT-seq runs.xlsx", sheet = 6)
#note here that some samples in the genotype data were missing from the metadata ("OkeCC19MILC_0001" "OkeCC19NEHR_0001" "OkeCC19TILR_0102"), the data was available from other sources, manually added them to the "combined" on the metadata xslx.
kable(meta_data %>% count(basin, detailed_location))
| basin | detailed_location | n |
|---|---|---|
| Coos | Coos River | 3 |
| Nehalem | Nehalem River | 50 |
| Netarts | Whiskey Creek | 26 |
| Siletz | Bear Creek | 3 |
| Tillamook | Kilchis River | 50 |
| Tillamook | Miami River | 50 |
| Yaquina | Mill Creek | 84 |
| Yaquina | Simpson Creek | 10 |
map_data <- meta_data %>%
select(basin, detailed_location, lon, lat)
bound_box <- make_bbox(lon = c(-123,-125), lat = map_data$lat, f = .2)
map <- get_map(location = bound_box, maptype = "terrain", source = "google", zoom = 8)
ggmap(map) + geom_point(data = map_data, aes(x = lon, y = lat), color = "red")+ geom_text(data = distinct(map_data, basin, .keep_all = TRUE), aes(label = basin ), hjust = 1.2)
ggmap(map) + geom_point(data = map_data, aes(x = lon, y = lat), color = "red")+ geom_text(data = distinct(arrange(map_data, desc(lat)), basin, .keep_all = TRUE), aes(label = basin ), hjust = 1.2)+xlab("")+ylab("")
Note here that the number of Mill Creek samples is 10 greater than reflected in the sample summary from raw genotype data because 10 archival scale samples were not included in the library.
Genotype Data
Detailed log of genotyping (raw reads to filtered genotypes) is available in a R notebook
The final filtered dataset includes 235 individuals and 325 markers. Basin and subbasin sample sizes after filtering are below
#note here we also unify the metadata and population names
load("genotype_data/genind_2.0.R")
load("genotype_data/genotypes_2.2.R")
genos_2.0 %<>%
left_join(meta_data)
genind_2.0@pop <- as.factor(genos_2.0$basin)
kable(genos_2.0 %>%
group_by(basin, detailed_location) %>%
summarise(n = n(), missing_data = mean((100-`%GT`)/100))
)
| basin | detailed_location | n | missing_data |
|---|---|---|---|
| Coos | Coos River | 3 | 0.0552333 |
| Nehalem | Nehalem River | 49 | 0.0276490 |
| Netarts | Whiskey Creek | 13 | 0.1281462 |
| Siletz | Bear Creek | 3 | 0.0495333 |
| Tillamook | Kilchis River | 45 | 0.0600022 |
| Tillamook | Miami River | 46 | 0.0696913 |
| Yaquina | Mill Creek | 67 | 0.0521119 |
| Yaquina | Simpson Creek | 9 | 0.0066778 |
kable(genos_2.0 %>%
group_by(basin) %>%
summarise(n = n(), missing_data = mean((100-`%GT`)/100))
)
| basin | n | missing_data |
|---|---|---|
| Coos | 3 | 0.0552333 |
| Nehalem | 49 | 0.0276490 |
| Netarts | 13 | 0.1281462 |
| Siletz | 3 | 0.0495333 |
| Tillamook | 91 | 0.0649000 |
| Yaquina | 76 | 0.0467316 |
First let’s took a look at the data in PC space.
X <- scaleGen(genind_2.0, NA.method="mean")
#then run pca
pca1 <- dudi.pca(X, scale = FALSE, scannf = FALSE, nf = 325)
#check pcs to keep with kaiser-guttman
#kaiser guttman
cutoff<-mean(pca1$eig)
kg <- length((pca1$eig)[(pca1$eig)>cutoff])
barplot(pca1$eig, main = "PCA eigenvalues")
abline(h = cutoff, col = "red")
#broken stick
n <- length(pca1$eig)
bsm <- data.frame(j=seq(1:n), p = 0)
bsm$p[1] <- 1/n
for (i in 2:n){
bsm$p[i] <- bsm$p[i-1]+(1/(n+1-i))
}
bsm$p <- 100*bsm$p/n
pca_eigs_to_plot <- as.data.frame(cbind(100*pca1$eig/sum(pca1$eig)), rev(bsm$p))
pca_eigs_to_plot %<>%
rownames_to_column(var = "bsm") %>%
rename(pca_eig_perc = V1) %>%
mutate(pca_eig_perc = as.numeric(pca_eig_perc))
#kept all PCs
snp_pcs <- pca1$li#[,c(1:kg)]
#now plot data
snp_pcs %<>%
rownames_to_column("sample") %>%
left_join(meta_data)
ggplot(data = snp_pcs)+geom_point(aes(Axis1, Axis2, color = basin)) + stat_ellipse(aes(Axis1, Axis2, color = basin)) +theme_classic()+scale_color_scico_d()
ggplot(data = snp_pcs)+geom_point(aes(Axis1, Axis3, color = basin)) + stat_ellipse(aes(Axis1, Axis2, color = basin)) +theme_classic()+scale_color_scico_d()
#s.class(pca1$li, as.factor(snp_pcs$basin),xax=1,yax=2, col=transp(viridisLite::viridis(6),0.7), axesell=FALSE, cstar=0, cpoint=1, grid=FALSE)
#add.scatter.eig(pca1$eig[1:10],nf=3,xax=1,yax=2, posi = "bottomright" )
#3d plot as well
plotly::plot_ly(x=snp_pcs$Axis1, y=snp_pcs$Axis2, z=snp_pcs$Axis3, type="scatter3d", mode="markers", color=snp_pcs$basin, alpha = 0.8)
#and plot by approx distance
ggplot(data = snp_pcs)+geom_point(aes(Axis1, Axis2, color = lat)) +theme_classic()+scale_color_viridis_c()
ggplot(data = snp_pcs)+geom_point(aes(Axis1, Axis3, color = lat)) +theme_classic()+scale_color_viridis_c()
plotly::plot_ly(x=snp_pcs$Axis1, y=snp_pcs$Axis2, z=snp_pcs$Axis3, type="scatter3d", mode="markers", color=snp_pcs$lat, alpha = 0.8, colors = )
# publication plot (with bsm_+kaiser guttman criteria)
a <- ggplot(data = snp_pcs)+geom_point(aes(Axis1, Axis2, color = basin), alpha = 0.5) + stat_ellipse(aes(Axis1, Axis2, color = basin)) +theme_classic()+scale_color_scico_d(name= "Basin")+xlab("PC1\n2.1% of variance")+ylab("PC1\n1.7% of variance")
pca_eigs_to_plot %<>%
rowid_to_column("row_n") %>%
mutate(bsm = as.numeric(bsm)) %>%
pivot_longer(!row_n, names_to = "bsm_or_eig", values_to = "percent_variance")
ggplot(data = pca_eigs_to_plot[1:10,])+geom_bar(aes(x = as.factor(row_n), y = percent_variance, color = bsm_or_eig, fill = bsm_or_eig), stat = "identity", position=position_dodge())+theme_bw()
inset_scree <- pca_eigs_to_plot[1:16,] %>%
filter( bsm_or_eig == "pca_eig_perc") %>%
mutate(fillcolor = case_when(row_n < 3 ~ "a",
TRUE ~"b"))
b <- ggplot(data = inset_scree)+geom_bar(aes(x = as.factor(row_n), y = percent_variance, fill = fillcolor), stat = "identity", position=position_dodge())+theme_classic()+theme(legend.position = "none", axis.title.x = element_blank(), axis.title.y = element_blank(), axis.text = element_blank(), axis.ticks = element_blank())+scale_fill_manual(values = c( "grey15", "grey67"))
cowplot::ggdraw(a ) +
cowplot::draw_plot(b, 0.1, 0.7, .25, .25)
PCA IBD There appears to be rough relationship between latitude and PC1 score, but this may be due to Yaquina vs all other structure (Coos is furthest south yet intermediate PC scores). We will formally examine IBD using mantel tests and spatial eigenalysis later, lets just quickly plot it
ggplot(data = snp_pcs)+geom_point(aes(lat, Axis1, color = basin))+geom_smooth(aes( lat, Axis1))+theme_bw()+scale_color_scico_d()
Sub-basin structure We should also examine for sub-basin structure. Let’s compare Miami to Kilchis Rivers.
First do they vary along the primary axis of variation found among all samples (PC1)?
ggplot(data = filter(snp_pcs, basin == "Tillamook"))+geom_point(aes(Axis1, Axis2, color = detailed_location)) +theme_classic()+scale_color_scico_d(begin = 0.2, end = 0.8)
#ggplot(data = filter(snp_pcs, basin == "Yaquina"))+geom_point(aes(Axis1, Axis2, color = detailed_location)) +theme_classic()+scale_color_scico_d(begin = 0.2, end = 0.8)
No, but, these populations may vary along a different axis (i.e. sub-basin structure may not reflect a smaller degree of separation along the same genetic axis). We could search for smaller PCs where the centroid varies across these two groups, or much easier for now, just conduct their own PCA.
Xtill <- scaleGen(seppop(genind_2.0)$Tillamook, NA.method="mean")
## Warning in .local(x, ...): Some scaling values are null.
## Corresponding alleles are removed.
#then run pca
pca_till <- dudi.pca(Xtill, scale = FALSE, scannf = FALSE, nf = 320)
#check pcs to keep with kaiser-guttman
#kaiser guttman
cutoff<-mean(pca_till$eig)
kg <- length((pca_till$eig)[(pca_till$eig)>cutoff])
barplot(pca_till$eig, main = "PCA eigenvalues")
abline(h = cutoff, col = "red")
#kept all PCs
snp_pcs_till <- pca_till$li#[,c(1:kg)]
#now plot data
snp_pcs_till %<>%
rownames_to_column("sample") %>%
left_join(meta_data)
## Joining, by = "sample"
ggplot(data = snp_pcs_till)+geom_point(aes(Axis1, Axis2, color = detailed_location)) + stat_ellipse(aes(Axis1, Axis2, color = detailed_location)) +theme_classic()+scale_color_scico_d(begin = 0.2, end = 0.8)
ggplot(data = snp_pcs_till)+geom_point(aes(Axis1, Axis3, color = detailed_location)) + stat_ellipse(aes(Axis1, Axis2, color = detailed_location)) +theme_classic()+scale_color_scico_d(begin = 0.2, end = 0.8)
#
# Xyaq <- scaleGen(seppop(genind_2.0)$Yaquina, NA.method="mean")
#
#
# #then run pca
# pca_yaq <- dudi.pca(Xyaq, scale = FALSE, scannf = FALSE, nf = 320)
#
# #check pcs to keep with kaiser-guttman
#
# #kaiser guttman
# cutoff<-mean(pca_yaq$eig)
# kg <- length((pca_yaq$eig)[(pca_yaq$eig)>cutoff])
# barplot(pca_yaq$eig, main = "PCA eigenvalues")
# abline(h = cutoff, col = "red")
#
# #kept all PCs
# snp_pcs_yaq <- pca_yaq$li#[,c(1:kg)]
#
# #now plot data
# snp_pcs_yaq %<>%
# rownames_to_column("sample") %>%
# left_join(meta_data)
#
# ggplot(data = snp_pcs_yaq)+geom_point(aes(Axis1, Axis2, color = detailed_location)) + stat_ellipse(aes(Axis1, Axis2, color = detailed_location)) +theme_classic()+scale_color_scico_d(begin = 0.2, end = 0.8)
# ggplot(data = snp_pcs_yaq)+geom_point(aes(Axis1, Axis3, color = detailed_location)) + stat_ellipse(aes(Axis1, Axis2, color = detailed_location)) +theme_classic()+scale_color_scico_d(begin = 0.2, end = 0.8)
Nope, very little difference in group centroids along the first two PCs in either Yaquina or Tillamook basins.
PCA Results Summary
The first 33 PCs are significant according to the Kaiser-Guttman criteria, however, an ad-hoc/by eye examination suggests only the first PC is revealing any real structure in the data. This first PC largely separates Yaquina from the rest of the samples. There is a relationship between latitude and PC1 scores, suggestive of IBD, but it may be driven by the large difference between the Yaquina and other samples, especially considering that Coos River samples demonstrate intermediate PC values despite being on the extreme south of the sampling area.
Let’s examine patterns of heterozygosity and diversity next.
n.pop <- seppop(genind_2.0)
hobs <- lapply(n.pop, function(x) (summary(x)$Hobs))
hobs <- as.data.frame(t(do.call(rbind, hobs)))
hobs <- hobs %>%
rownames_to_column(var="marker")
hobs <- hobs %>%
pivot_longer(-marker, names_to = "basin", values_to = "Ho")
#ggplot(hobs)+geom_boxplot(aes(x=pop, y=hobs))+theme_classic()+xlab("Run Timing")+ylab("Observed Heterozygosity")
hexp <- lapply(n.pop, function(x) (summary(x)$Hexp))
hexp <- as.data.frame(t(do.call(rbind, hexp)))
hexp <- hexp %>%
rownames_to_column(var="marker") %>%
pivot_longer(-marker, names_to = "basin", values_to = "He")
marker_divs <- hexp %>%
left_join(hobs) %>%
pivot_longer(c("Ho", "He"), names_to = "obs_exp", values_to ="H")
H_summary <- count(genos_2.0, basin)
#plot
ggplot(data=marker_divs)+geom_boxplot(aes(x=basin, y=H, fill = obs_exp))+theme_classic()+geom_text(data = H_summary,aes(basin, Inf, label = paste0("n = ", n), vjust = 1))+scale_fill_scico_d(begin = 0.2, end = 0.8, palette ="batlow", name = "")
#table
marker_divs %>%
group_by(basin, obs_exp) %>%
summarise(mean = mean(H))
#spatial pattern?
#spatial_he <- marker_divs %>%
# group_by(basin, obs_exp) %>%
# filter(obs_exp == "He") %>%
# summarise(mean = mean(H)) %>%
# left_join(sites)
#summary(aov(data = spatial_he, mean~lat))
Significance Testing
Are these differences between populations significant? To test with the same number of loci (across populations) we’ll use a monte-carlo test and 999 permutations.
#
a <- Hs.test(n.pop$Coos, n.pop$Nehalem)
b <- Hs.test(n.pop$Coos, n.pop$Netarts)
c <- Hs.test(n.pop$Coos, n.pop$Siletz)
d <- Hs.test(n.pop$Coos, n.pop$Tillamook)
e <- Hs.test(n.pop$Coos, n.pop$Yaquina)
f <- Hs.test(n.pop$Nehalem, n.pop$Netarts)
g <- Hs.test(n.pop$Nehalem, n.pop$Siletz)
h <- Hs.test(n.pop$Nehalem, n.pop$Tillamook)
i <- Hs.test(n.pop$Nehalem, n.pop$Yaquina)
j <- Hs.test(n.pop$Netarts, n.pop$Siletz)
k <- Hs.test(n.pop$Netarts, n.pop$Tillamook)
l <- Hs.test(n.pop$Netarts, n.pop$Yaquina)
m <- Hs.test(n.pop$Siletz, n.pop$Tillamook)
n <- Hs.test(n.pop$Siletz, n.pop$Yaquina)
o <- Hs.test(n.pop$Tillamook, n.pop$Yaquina)
#now make a nice table
df <- as.data.frame(cbind(c("Coos", "Coos", "Coos", "Coos", "Coos", "Nehalem", "Nehalem","Nehalem","Nehalem", "Netarts", "Netarts", "Netarts", "Siletz", "Siletz", "Tillamook"), c("Nehalem", "Netarts", "Siletz", "Tillamook", "Yaquina", "Netarts", "Siletz", "Tillamook", "Yaquina", "Siletz", "Tillamook", "Yaquina", "Tillamook", "Yaquina", "Yaquina"), c(a$pvalue,b$pvalue,c$pvalue,d$pvalue,e$pvalue,f$pvalue, g$pvalue, h$pvalue, i$pvalue, j$pvalue, k$pvalue, l$pvalue, m$pvalue, n$pvalue, o$pvalue)))
colnames(df) <- c("pop1", "pop2","p-value")
df$p.adj <- p.adjust(df$`p-value`,n = nrow(df), method = "fdr")
kable(df)
rm(list = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "df"))
No basin pairs demonstrated significant differnce in expected heterozygosity after FDR correction.
Loci out of HWE
Are there significant departures from HWE at the loci level?
# here we use the hw.test function from pegas (exact test based on Monte Carlo permutations of alleles, 1000 permutations)
HWE.test <- lapply(n.pop, function(x) hw.test(x, B=1000))
# here we take the list of dataframes of p-values and combine into a single dataframe
hwe <- reduce(HWE.test, cbind)
hwe <- hwe[,c(4,8,12,16,20,24)]
colnames(hwe) <- c("Coos", "Nehalem", "Netarts", "Siletz", "Tillamook", "Yaquina")
hwe <- as.data.frame(hwe)
# next we correct for multiple comparisons
p.adj <- as.data.frame(apply(hwe, MARGIN = 2, function(x) p.adjust(x, "fdr")))
hwe_exceed <- p.adj %>% rownames_to_column(var="marker") %>%
pivot_longer(-marker, names_to = "basin", values_to = "fdr") %>%
filter(fdr < 0.1)
hwe_exceed <- hwe_exceed %>%
left_join(pivot_wider(marker_divs, names_from = obs_exp, values_from = H)) %>%
mutate(direction = if_else(He > Ho, "excess_homo", "excess_hetero"))
a <- hwe_exceed%>%
group_by(basin) %>%
tally()
kable(a, caption = "Number of markers significantly out of HWE")
Only 1 marker (Oke_RDDFW16781_68)was out of HW proportions in a single population (Yaquina), with less observed heterozygosity than expected.
Heterozygosity Results Summary
No populations demonstrated signifance differences in the extent of genetic diversity as measured by He. Only a single marker in a single population did not follow Hardy-Weinberg proportions.
First let’s look at some dataset-wide basic Fstats
fstat <- genind2hierfstat(genind_2.0)
colnames(fstat) <- c(pop, names(genind_2.0$loc.n.all))
basicstats <- basic.stats(fstat)
kable(basicstats$overall, caption = "All markers Fstats")
| x | |
|---|---|
| Ho | 0.3324 |
| Hs | 0.3199 |
| Ht | 0.3267 |
| Dst | 0.0067 |
| Htp | 0.3280 |
| Dstp | 0.0081 |
| Fst | 0.0206 |
| Fstp | 0.0246 |
| Fis | -0.0391 |
| Dest | 0.0119 |
Overall Fst is 0.019 and Fis is -0.04
Pairwise FST
Next let’s look at pairwise FST among basins
genet.dist(fstat, method="WC84")
## Coos Nehalem Netarts Siletz Tillamook
## Nehalem 0.019878392
## Netarts 0.031610832 0.009869960
## Siletz 0.032812760 0.017418342 0.036952394
## Tillamook 0.032519430 0.003442455 0.014195557 0.029883697
## Yaquina 0.043404971 0.013729597 0.022842561 0.018033977 0.013462716
Moderate Fst among basin at the markers in the panel. Ranging from 0.003 to 0.043
What about within a basin. Let’s compare Miami to Kilchis samples
#check order for merge
#row.names(n.pop$Tillamook$tab)== filter(genos_2.0, basin == "Tillamook") %>% pull(sample)
n.pop <- seppop(genind_2.0)
#add detailed_location as pop in genind
genind_till<- n.pop$Tillamook
genind_till@pop <- as.factor(filter(genos_2.0, basin == "Tillamook") %>% pull(detailed_location))
fstat_till <- genind2hierfstat(genind_till)
colnames(fstat_till) <- c(pop, names(genind_till$loc.n.all))
genet.dist(fstat_till, method="WC84")
## Kilchis River
## Miami River 0.003610589
basic_till <- basic.stats(fstat_till)
basic_till$overall
## Ho Hs Ht Dst Htp Dstp Fst Fstp Fis Dest
## 0.3433 0.3188 0.3194 0.0006 0.3199 0.0012 0.0018 0.0036 -0.0770 0.0017
#genind_yaq<- n.pop$Yaquina
#genind_yaq@pop <- as.factor(filter(genos_2.0, basin == "Yaquina") %>% pull(detailed_location))
#fstat_yaq <- genind2hierfstat(genind_yaq)
#colnames(fstat_yaq) <- c(pop, names(genind_yaq$loc.n.all))
#genet.dist(fstat_yaq, method="WC84")
#basic_till <- basic.stats(fstat_till)
#basic_till$overall
Very low differentiation at these markers (0.003 and 0).
site wise fst
for the spatial analysis it is probably preferable to split Kilchis and Miami, we’ll need an FST table across all sites, even better if it is in a clear order
#add detailed_location as pop in genind
genind_site<- genind_2.0
genind_site@pop <- as.factor(pull(genos_2.0, detailed_location))
fstat_site <- genind2hierfstat(genind_site)
colnames(fstat_site) <- c(pop, names(genind_site$loc.n.all))
gdist <- genet.dist(fstat_site, method="WC84")
gdist <- as.matrix(gdist)[c("Coos River", "Mill Creek", "Simpson Creek", "Bear Creek", "Whiskey Creek", "Kilchis River", "Miami River", "Nehalem River"), c("Coos River", "Mill Creek","Simpson Creek", "Bear Creek", "Whiskey Creek", "Kilchis River", "Miami River", "Nehalem River")]
gdist <- as.dist(gdist)
gdist
## Coos River Mill Creek Simpson Creek Bear Creek
## Mill Creek 4.131360e-02
## Simpson Creek 5.184011e-02 -4.146886e-05
## Bear Creek 3.259031e-02 1.779160e-02 2.925306e-02
## Whiskey Creek 4.052117e-02 2.618903e-02 2.539051e-02 4.205445e-02
## Kilchis River 2.972595e-02 1.741033e-02 2.301431e-02 3.467962e-02
## Miami River 3.188949e-02 1.370271e-02 1.748159e-02 2.520047e-02
## Nehalem River 2.664264e-02 1.378431e-02 1.679836e-02 2.185135e-02
## Whiskey Creek Kilchis River Miami River
## Mill Creek
## Simpson Creek
## Bear Creek
## Whiskey Creek
## Kilchis River 1.996319e-02
## Miami River 1.668113e-02 3.610589e-03
## Nehalem River 1.502888e-02 3.417890e-03 3.800478e-03
Marker level Fst
Now let’s look at the distribution of Fst among markers.
ggplot(basicstats$perloc)+geom_histogram(aes(x=Fst))+theme_classic()+ggtitle("among basin Fst")
ggplot(basic_till$perloc)+geom_histogram(aes(x=Fst))+theme_classic()+ggtitle("wihin Tillamook Fst")
#check if any non-run timing markers have high Fst
#max((basicstats$perloc[basicstats$perloc$run_timing=="non-run",])$Fst, na.rm = TRUE)
Yes, potentially some outliers here in the among basin comparisons, but without annotations yet, not much of interest we can say or do with this result. It would be good to get in touch with WDFW about how they chose the markers (reports/ms describing it all cite to something unpublished).
We need a good set of spatial variables against which to examine the distribution of genetic variation. We will first generate a distance matrix based on alongshore distance, this use this distance matrix to generate Moran’s eigenvector maps for spatial eigenanalysis.
Distance Matrix
Rather than use line-of-sight for the distances among samples, we will attempt to estimate travel paths. We use mouth of the named tributary in the sample metadata. If no precise sampling location we use the mouth of the basin at the head of the bay.
#or_coast<- getNOAA.bathy(lon1 = -125, lon2 = -123, lat1 = 43, lat2 = 46, resolution = 1)
blues <- c("lightsteelblue4", "lightsteelblue3", "lightsteelblue2", "lightsteelblue1")
greys <- c(grey(0.6), grey(0.93), grey(0.99))
#downloaded a higher resolution bathymetry dataset from GEBCO (NOAA bathy had some sampling locations at >100m elevation)
or_coast <- readGEBCO.bathy("bathy/gebco_2020_n46.26621723175049_s42.488765716552734_w-124.53535079956053_e-122.70243644714354.nc")
trans<-trans.mat(or_coast, min.depth=17 , max.depth=-20)
sites <- distinct(genos_2.0, basin, detailed_location, lat, lon)
sites %<>%
arrange(lat) %>%
column_to_rownames(var = "detailed_location")
path <- lc.dist(trans, sites[,c(3,2)], res = "path")
##
|
| | 0%
|
|== | 4%
|
|===== | 7%
|
|======== | 11%
|
|========== | 14%
|
|============ | 18%
|
|=============== | 21%
|
|================== | 25%
|
|==================== | 29%
|
|====================== | 32%
|
|========================= | 36%
|
|============================ | 39%
|
|============================== | 43%
|
|================================ | 46%
|
|=================================== | 50%
|
|====================================== | 54%
|
|======================================== | 57%
|
|========================================== | 61%
|
|============================================= | 64%
|
|================================================ | 68%
|
|================================================== | 71%
|
|==================================================== | 75%
|
|======================================================= | 79%
|
|========================================================== | 82%
|
|============================================================ | 86%
|
|============================================================== | 89%
|
|================================================================= | 93%
|
|==================================================================== | 96%
|
|======================================================================| 100%
distance_matrix <- lc.dist(trans, sites[,c(3,2)], res = "dist")
#get.depth(or_coast, sites[,c(3,2)] , locator = FALSE)
distance_matrix
## Coos River Mill Creek Simpson Creek Bear Creek Whiskey Creek
## Mill Creek 143
## Simpson Creek 143 0
## Bear Creek 175 33 33
## Whiskey Creek 236 94 94 61
## Kilchis River 253 110 110 78 18
## Miami River 253 111 111 78 19
## Nehalem River 271 128 128 96 37
## Kilchis River Miami River
## Mill Creek
## Simpson Creek
## Bear Creek
## Whiskey Creek
## Kilchis River
## Miami River 10
## Nehalem River 31 27
plot(or_coast, image = TRUE, land = TRUE, lwd = 0.03, bpal = list(c(0, max(or_coast), greys),c(min(or_coast), 0, blues)), xlim = c(-125, -123), ylim = c(43,46))
points(sites[,c(3,2)], pch = 20, col = "red", cex = 2.0 )
lapply(path, lines, col = "blue", lwd = 2, lty = 1)->dummy
Bathymetric map of sampling locations with minimum path between sites constrained by deep water (gold line - 20m)
#autoplot(or_coast, geom=c("r"))+ scale_fill_gradientn(values = scales::rescale(c(-3300, -100, 1, 2000)), colors = c("steelblue4", "#C7E0FF", "grey50", "grey80"))+
#autoplot(atlantic2, geom=c("r")) + scale_fill_gradientn(values = scales::rescale(c(-6600, -10, 1, 2000)), colors = c("steelblue4", "#C7E0FF", "grey50", "grey80"))+geom_point(data= map_data, aes(lon, lat, color = mean_annual_max), size = 4)+scale_color_viridis(option = "plasma")+xlab("Longitude") +ylab("Latitude")+guides(fill=FALSE)+labs(color = "")+theme(legend.position = c(0.9, 0.2), legend.background = element_rect(fill = "transparent"))+annotate(geom = "text", x = -69, y = 31.2, label = "Mean Annual\nMaximum Temperature", angle= 90)
dbMEMs
Now let’s make some distance based MEMs from the distance matrix.
#make the distance matrix big
sites_big <- sites %>%
rownames_to_column(var = "detailed_location") %>%
right_join(select(genos_2.0, detailed_location)) %>%
mutate(lat_jitter <- jitter(lat, amount = 0.00001)) %>%
mutate(lon_jitter <- jitter(lon, amount = 0.00001))
## Joining, by = "detailed_location"
distance_matrix_big <- lc.dist(trans, sites_big[,c(6,5)], res = "dist")
## Loading required package: gdistance
## Warning: package 'gdistance' was built under R version 3.6.2
## Loading required package: raster
## Warning: package 'raster' was built under R version 3.6.2
## Loading required package: sp
## Warning: package 'sp' was built under R version 3.6.2
##
## Attaching package: 'raster'
## The following object is masked from 'package:magrittr':
##
## extract
## The following object is masked from 'package:dplyr':
##
## select
## The following object is masked from 'package:tidyr':
##
## extract
## The following objects are masked from 'package:ape':
##
## rotate, zoom
## Loading required package: igraph
## Warning: package 'igraph' was built under R version 3.6.2
##
## Attaching package: 'igraph'
## The following object is masked from 'package:raster':
##
## union
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following object is masked from 'package:pegas':
##
## mst
## The following objects are masked from 'package:ape':
##
## edges, mst, ring
## The following object is masked from 'package:vegan':
##
## diversity
## The following object is masked from 'package:permute':
##
## permute
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.6.2
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'gdistance'
## The following object is masked from 'package:igraph':
##
## normalize
#dbmem <- dbmem(sites_big[,c(6,5)], MEM.autocor = "positive")
dbmem <- dbmem(distance_matrix_big, MEM.autocor = "all")
## Warning in dudi.pco(matdist, scannf = FALSE, nf = 2): Non euclidean distance
test <- moran.randtest(dbmem, nrepet = 999)
test
## class: krandtest lightkrandtest
## Monte-Carlo tests
## Call: moran.randtest(x = dbmem, nrepet = 999)
##
## Number of tests: 234
##
## Adjustment method for multiple comparisons: none
## Permutation number: 999
## Test Obs Std.Obs Alter Pvalue
## 1 MEM1 2.180908e-02 48.26223742 greater 0.001
## 2 MEM2 1.297324e-02 51.36184168 greater 0.001
## 3 MEM3 -1.004042e-32 10.44382174 greater 0.001
## 4 MEM4 -4.129828e-33 8.32322861 greater 0.002
## 5 MEM5 -7.690947e-31 9.43933776 greater 0.001
## 6 MEM6 -4.905544e-33 11.48969934 greater 0.001
## 7 MEM7 -4.818217e-33 10.42732186 greater 0.001
## 8 MEM8 -1.585746e-26 10.30593199 greater 0.001
## 9 MEM9 -5.861216e-33 11.84701135 greater 0.001
## 10 MEM10 -5.240765e-33 11.97326956 greater 0.001
## 11 MEM11 -8.260951e-33 11.95832954 greater 0.001
## 12 MEM12 -8.425447e-33 11.01256234 greater 0.001
## 13 MEM13 -8.757882e-33 11.83419166 greater 0.001
## 14 MEM14 -1.661150e-32 10.55205733 greater 0.001
## 15 MEM15 -3.827457e-32 8.30810549 greater 0.001
## 16 MEM16 -3.455235e-33 9.85999607 greater 0.001
## 17 MEM17 -2.451581e-32 10.37638282 greater 0.002
## 18 MEM18 -6.204278e-33 11.32271509 greater 0.001
## 19 MEM19 -1.454640e-32 12.49121780 greater 0.001
## 20 MEM20 -6.895926e-33 12.45806284 greater 0.001
## 21 MEM21 -5.523868e-33 10.78037310 greater 0.001
## 22 MEM22 -1.104052e-23 9.50935188 greater 0.001
## 23 MEM23 -9.063303e-33 11.58145339 greater 0.001
## 24 MEM24 -8.211850e-33 11.77584466 greater 0.001
## 25 MEM25 -1.118312e-28 10.50957286 greater 0.001
## 26 MEM26 -5.130824e-33 11.84303768 greater 0.001
## 27 MEM27 -2.041767e-29 9.36037704 greater 0.003
## 28 MEM28 -4.585437e-40 5.64373974 greater 0.017
## 29 MEM29 -1.786892e-32 11.78074843 greater 0.001
## 30 MEM30 -5.329065e-33 12.45810638 greater 0.001
## 31 MEM31 -5.020936e-33 11.18127209 greater 0.001
## 32 MEM32 -8.968120e-33 11.08547441 greater 0.001
## 33 MEM33 -7.521382e-33 11.76627489 greater 0.001
## 34 MEM34 -2.702193e-27 9.28962033 greater 0.001
## 35 MEM35 -6.289071e-33 9.27357883 greater 0.001
## 36 MEM36 -6.177393e-33 10.40160222 greater 0.001
## 37 MEM37 -1.234123e-24 10.54100979 greater 0.001
## 38 MEM38 -6.563141e-33 11.75062954 greater 0.001
## 39 MEM39 -7.781234e-33 10.70943196 greater 0.001
## 40 MEM40 -1.212609e-32 11.75474481 greater 0.001
## 41 MEM41 -3.551438e-28 11.10265082 greater 0.001
## 42 MEM42 -8.206051e-33 11.37937348 greater 0.001
## 43 MEM43 -1.437598e-32 12.17585954 greater 0.001
## 44 MEM44 -1.165436e-32 10.59094568 greater 0.001
## 45 MEM45 -8.176431e-33 9.25963576 greater 0.001
## 46 MEM46 -4.476875e-33 11.61906057 greater 0.001
## 47 MEM47 -6.532007e-33 9.01832627 greater 0.001
## 48 MEM48 -4.252111e-03 0.02215873 greater 0.476
## 49 MEM49 -4.476082e-03 -0.40673431 greater 0.760
## 50 MEM50 -4.476082e-03 -0.29561058 greater 0.812
## 51 MEM51 -4.476082e-03 -0.31367369 greater 0.769
## 52 MEM52 -4.476082e-03 -0.40658000 greater 0.751
## 53 MEM53 -4.476082e-03 -0.48403922 greater 0.718
## 54 MEM54 -4.476082e-03 -0.49640390 greater 0.745
## 55 MEM55 -4.476082e-03 -0.51360770 greater 0.721
## 56 MEM56 -4.476082e-03 -0.42126234 greater 0.733
## 57 MEM57 -4.476082e-03 -0.49884431 greater 0.705
## 58 MEM58 -4.476082e-03 -0.58220887 greater 0.743
## 59 MEM59 -4.476082e-03 -0.41849662 greater 0.753
## 60 MEM60 -4.476082e-03 -0.34402184 greater 0.773
## 61 MEM61 -4.476082e-03 -0.44970047 greater 0.712
## 62 MEM62 -4.476082e-03 -0.55404964 greater 0.712
## 63 MEM63 -4.476082e-03 -0.56301609 greater 0.736
## 64 MEM64 -4.476082e-03 -0.48764263 greater 0.737
## 65 MEM65 -4.476082e-03 -0.56219815 greater 0.729
## 66 MEM66 -4.476082e-03 -0.33088341 greater 0.779
## 67 MEM67 -4.476082e-03 -0.33696349 greater 0.764
## 68 MEM68 -4.476082e-03 -0.57002668 greater 0.725
## 69 MEM69 -4.476082e-03 -0.51875241 greater 0.707
## 70 MEM70 -4.476082e-03 -0.57371906 greater 0.741
## 71 MEM71 -4.476082e-03 -0.57428648 greater 0.720
## 72 MEM72 -4.476082e-03 -0.51274695 greater 0.725
## 73 MEM73 -4.476082e-03 -0.57566630 greater 0.723
## 74 MEM74 -4.476082e-03 -0.57533504 greater 0.739
## 75 MEM75 -4.476082e-03 -0.59104897 greater 0.731
## 76 MEM76 -4.476082e-03 -0.51649007 greater 0.730
## 77 MEM77 -4.476082e-03 -0.54428943 greater 0.758
## 78 MEM78 -4.476082e-03 -0.55494628 greater 0.725
## 79 MEM79 -4.476082e-03 -0.50192547 greater 0.722
## 80 MEM80 -4.476082e-03 -0.57959278 greater 0.749
## 81 MEM81 -4.476082e-03 -0.54455586 greater 0.724
## 82 MEM82 -4.476082e-03 -0.50583551 greater 0.734
## 83 MEM83 -4.476082e-03 -0.51924945 greater 0.698
## 84 MEM84 -4.476082e-03 -0.60152424 greater 0.730
## 85 MEM85 -4.476082e-03 -0.52880988 greater 0.714
## 86 MEM86 -4.476082e-03 -0.55896822 greater 0.706
## 87 MEM87 -4.476082e-03 -0.56382278 greater 0.729
## 88 MEM88 -4.476082e-03 -0.55742609 greater 0.718
## 89 MEM89 -4.476082e-03 -0.59167605 greater 0.725
## 90 MEM90 -4.476082e-03 -0.57131108 greater 0.745
## 91 MEM91 -4.476082e-03 -0.54269412 greater 0.706
## 92 MEM92 -4.476082e-03 -0.55353242 greater 0.728
## 93 MEM93 -4.476082e-03 -0.48418768 greater 0.717
## 94 MEM94 -4.476082e-03 -0.43436633 greater 0.743
## 95 MEM95 -4.476082e-03 -0.54121816 greater 0.739
## 96 MEM96 -4.476082e-03 -0.45303023 greater 0.716
## 97 MEM97 -4.476082e-03 -0.54846691 greater 0.753
## 98 MEM98 -4.476082e-03 -0.45810130 greater 0.737
## 99 MEM99 -4.476082e-03 -0.56382856 greater 0.758
## 100 MEM100 -4.476082e-03 -0.54945918 greater 0.727
## 101 MEM101 -4.476082e-03 -0.49386531 greater 0.747
## 102 MEM102 -4.476082e-03 -0.47277792 greater 0.740
## 103 MEM103 -4.476082e-03 -0.48177368 greater 0.725
## 104 MEM104 -4.476082e-03 -0.51589995 greater 0.702
## 105 MEM105 -4.476082e-03 -0.51683901 greater 0.708
## 106 MEM106 -4.476082e-03 -0.60927868 greater 0.745
## 107 MEM107 -4.476082e-03 -0.63875521 greater 0.740
## 108 MEM108 -4.476082e-03 -0.54860305 greater 0.705
## 109 MEM109 -4.476082e-03 -0.56267324 greater 0.727
## 110 MEM110 -4.476082e-03 -0.56043007 greater 0.711
## 111 MEM111 -4.476082e-03 -0.54293670 greater 0.716
## 112 MEM112 -4.476082e-03 -0.57728691 greater 0.721
## 113 MEM113 -4.476082e-03 -0.53917765 greater 0.732
## 114 MEM114 -4.476082e-03 -0.50550154 greater 0.716
## 115 MEM115 -4.476082e-03 -0.64011243 greater 0.745
## 116 MEM116 -4.476082e-03 -0.58913021 greater 0.737
## 117 MEM117 -4.476082e-03 -0.58407538 greater 0.732
## 118 MEM118 -4.476082e-03 -0.50919418 greater 0.700
## 119 MEM119 -4.476082e-03 -0.47004564 greater 0.731
## 120 MEM120 -4.476082e-03 -0.53153471 greater 0.726
## 121 MEM121 -4.476082e-03 -0.57989454 greater 0.727
## 122 MEM122 -4.476082e-03 -0.39393487 greater 0.737
## 123 MEM123 -4.476082e-03 -0.54812924 greater 0.713
## 124 MEM124 -4.476082e-03 -0.59540669 greater 0.736
## 125 MEM125 -4.476082e-03 -0.59445455 greater 0.736
## 126 MEM126 -4.476082e-03 -0.57415457 greater 0.733
## 127 MEM127 -4.476082e-03 -0.57986636 greater 0.723
## 128 MEM128 -4.476082e-03 -0.51186736 greater 0.722
## 129 MEM129 -4.476082e-03 -0.63207234 greater 0.737
## 130 MEM130 -4.476082e-03 -0.64264351 greater 0.758
## 131 MEM131 -4.476082e-03 -0.56517387 greater 0.715
## 132 MEM132 -4.476082e-03 -0.54171560 greater 0.715
## 133 MEM133 -4.476082e-03 -0.55725134 greater 0.743
## 134 MEM134 -4.476082e-03 -0.54467809 greater 0.726
## 135 MEM135 -4.476082e-03 -0.59758422 greater 0.726
## 136 MEM136 -4.476082e-03 -0.63578666 greater 0.752
## 137 MEM137 -4.476082e-03 -0.53029616 greater 0.717
## 138 MEM138 -4.476082e-03 -0.49959369 greater 0.721
## 139 MEM139 -4.476082e-03 -0.43153120 greater 0.725
## 140 MEM140 -4.476082e-03 -0.53944536 greater 0.713
## 141 MEM141 -4.476082e-03 -0.55284868 greater 0.722
## 142 MEM142 -4.476082e-03 -0.50343738 greater 0.722
## 143 MEM143 -4.476082e-03 -0.51952466 greater 0.695
## 144 MEM144 -4.476082e-03 -0.52753625 greater 0.734
## 145 MEM145 -4.476082e-03 -0.39875745 greater 0.735
## 146 MEM146 -4.476082e-03 -0.49080103 greater 0.729
## 147 MEM147 -4.476082e-03 -0.62202396 greater 0.739
## 148 MEM148 -4.476082e-03 -0.49636279 greater 0.716
## 149 MEM149 -4.476082e-03 -0.52286591 greater 0.713
## 150 MEM150 -4.476082e-03 -0.46536534 greater 0.733
## 151 MEM151 -4.476082e-03 -0.56592488 greater 0.721
## 152 MEM152 -4.476082e-03 -0.54333346 greater 0.723
## 153 MEM153 -4.476082e-03 -0.58285413 greater 0.733
## 154 MEM154 -4.476082e-03 -0.53843167 greater 0.733
## 155 MEM155 -4.476082e-03 -0.47226054 greater 0.687
## 156 MEM156 -4.476082e-03 -0.59079356 greater 0.738
## 157 MEM157 -4.476082e-03 -0.54178474 greater 0.719
## 158 MEM158 -4.476082e-03 -0.56391132 greater 0.727
## 159 MEM159 -4.476082e-03 -0.58587840 greater 0.742
## 160 MEM160 -4.476082e-03 -0.56224588 greater 0.735
## 161 MEM161 -4.476082e-03 -0.49334046 greater 0.736
## 162 MEM162 -4.476082e-03 -0.50194337 greater 0.743
## 163 MEM163 -4.476082e-03 -0.60987529 greater 0.730
## 164 MEM164 -4.476082e-03 -0.53263683 greater 0.721
## 165 MEM165 -4.476082e-03 -0.48606452 greater 0.710
## 166 MEM166 -4.476082e-03 -0.51284720 greater 0.712
## 167 MEM167 -4.476082e-03 -0.53301748 greater 0.723
## 168 MEM168 -4.476082e-03 -0.51363449 greater 0.728
## 169 MEM169 -4.476082e-03 -0.56331354 greater 0.710
## 170 MEM170 -4.476082e-03 -0.55741103 greater 0.734
## 171 MEM171 -4.476082e-03 -0.62619983 greater 0.741
## 172 MEM172 -4.476082e-03 -0.55559607 greater 0.688
## 173 MEM173 -4.476082e-03 -0.52247194 greater 0.735
## 174 MEM174 -4.476082e-03 -0.63063609 greater 0.747
## 175 MEM175 -4.476082e-03 -0.49070507 greater 0.756
## 176 MEM176 -4.476082e-03 -0.58074760 greater 0.731
## 177 MEM177 -4.476082e-03 -0.56884015 greater 0.719
## 178 MEM178 -4.476082e-03 -0.52462070 greater 0.733
## 179 MEM179 -4.476082e-03 -0.33622023 greater 0.766
## 180 MEM180 -4.476082e-03 -0.59376376 greater 0.733
## 181 MEM181 -4.476082e-03 -0.38187687 greater 0.730
## 182 MEM182 -4.476082e-03 -0.50087941 greater 0.720
## 183 MEM183 -4.476082e-03 -0.46830886 greater 0.754
## 184 MEM184 -4.476082e-03 -0.38518203 greater 0.783
## 185 MEM185 -4.476082e-03 -0.55178458 greater 0.730
## 186 MEM186 -4.476082e-03 -0.44199143 greater 0.713
## 187 MEM187 -4.476082e-03 -0.57432092 greater 0.744
## 188 MEM188 -4.476082e-03 -0.36383669 greater 0.817
## 189 MEM189 -4.476082e-03 -0.50118147 greater 0.752
## 190 MEM190 -4.476082e-03 -0.52658107 greater 0.733
## 191 MEM191 -4.476082e-03 -0.38117412 greater 0.782
## 192 MEM192 -4.476082e-03 -0.25344516 greater 0.839
## 193 MEM193 -4.476082e-03 -0.50142252 greater 0.761
## 194 MEM194 -4.476082e-03 -0.51281883 greater 0.737
## 195 MEM195 -4.476082e-03 -0.47833331 greater 0.724
## 196 MEM196 -4.476887e-03 -0.58088067 greater 0.837
## 197 MEM197 -7.080352e-03 -8.58125263 greater 1.000
## 198 MEM198 -8.585216e-03 -10.17605521 greater 1.000
## 199 MEM199 -8.788524e-03 -12.80881095 greater 1.000
## 200 MEM200 -8.952164e-03 -10.13740200 greater 1.000
## 201 MEM201 -8.952164e-03 -10.65956382 greater 1.000
## 202 MEM202 -8.952164e-03 -11.34549626 greater 1.000
## 203 MEM203 -8.952164e-03 -11.99827292 greater 1.000
## 204 MEM204 -8.952164e-03 -12.39148243 greater 1.000
## 205 MEM205 -8.952164e-03 -10.43699072 greater 1.000
## 206 MEM206 -8.952164e-03 -13.25709023 greater 1.000
## 207 MEM207 -8.952164e-03 -11.86778392 greater 1.000
## 208 MEM208 -8.952164e-03 -11.59599426 greater 1.000
## 209 MEM209 -8.952164e-03 -12.81562868 greater 1.000
## 210 MEM210 -8.952164e-03 -12.37648064 greater 1.000
## 211 MEM211 -8.952164e-03 -12.39842844 greater 1.000
## 212 MEM212 -8.952164e-03 -12.31332037 greater 1.000
## 213 MEM213 -8.952164e-03 -10.26426717 greater 1.000
## 214 MEM214 -8.952164e-03 -11.99523107 greater 1.000
## 215 MEM215 -8.952164e-03 -12.95867867 greater 1.000
## 216 MEM216 -8.952164e-03 -12.15675319 greater 1.000
## 217 MEM217 -8.952164e-03 -13.33418445 greater 1.000
## 218 MEM218 -8.952164e-03 -9.61220349 greater 1.000
## 219 MEM219 -8.952164e-03 -10.84503957 greater 1.000
## 220 MEM220 -8.952164e-03 -10.39558523 greater 1.000
## 221 MEM221 -8.952164e-03 -12.69001350 greater 1.000
## 222 MEM222 -8.952164e-03 -10.92630309 greater 1.000
## 223 MEM223 -8.952164e-03 -10.97539290 greater 1.000
## 224 MEM224 -8.952164e-03 -12.54690144 greater 1.000
## 225 MEM225 -8.952164e-03 -10.37432609 greater 1.000
## 226 MEM226 -8.952164e-03 -12.42151382 greater 1.000
## 227 MEM227 -8.952488e-03 -9.12623944 greater 1.000
## 228 MEM228 -1.279930e-02 -23.92813329 greater 1.000
## 229 MEM229 -1.307024e-02 -22.72302117 greater 1.000
## 230 MEM230 -1.337145e-02 -23.80413773 greater 1.000
## 231 MEM231 -1.342825e-02 -22.46710842 greater 1.000
## 232 MEM232 -1.342825e-02 -22.03931913 greater 1.000
## 233 MEM233 -1.342825e-02 -18.75951067 greater 1.000
## 234 MEM234 -1.342847e-02 -16.21204844 greater 1.000
barplot(attr(dbmem, "values"),
main = "Eigenvalues of the spatial weighting matrix\nProportional to Moran's I\nFull Dataset", cex.main = 0.7)
dbmem <- dbmem(distance_matrix_big, MEM.autocor = "positive")
## Warning in dudi.pco(matdist, scannf = FALSE, nf = 2): Non euclidean distance
dbmem_sites <- as.data.frame(cbind(dbmem, sites_big))
dbmem_sites_summary <- dbmem_sites %>%
group_by(detailed_location) %>%
summarise(mem1 = mean(MEM1), mem2 = mean(MEM2))
map_data %<>%
left_join(dbmem_sites_summary)
## Joining, by = "detailed_location"
a <- ggmap(map) + geom_point(data = map_data, aes(x = lon, y = lat, color = mem1), size = 2)+ geom_text(data = distinct(map_data, basin, .keep_all = TRUE), aes(label = basin ), hjust = 1.2)+scale_color_scico(palette = "batlow")+ggtitle("MEM1")+xlab("")+ylab("")
b <- ggmap(map) + geom_point(data = map_data, aes(x = lon, y = lat, color = mem2), size = 2)+ geom_text(data = distinct(map_data, basin, .keep_all = TRUE), aes(label = basin ), hjust = 1.2)+scale_color_scico(palette = "batlow", name = "")+ggtitle("MEM2")
cowplot::plot_grid(a,b)
all MEMs with positive moran’s I (e.g. autocorrelation) are significant. Keeping the first two. MEM1 largely captures the big distance between Coos Bay and all the other sampling sites. MEM2 is effectively distance away from Yaquina and captures spatial autocorrelation at a finer scale.
Here we examine the correlation between spatial distance and linearized FST to test for IBD with a simple Mantel test
gdist_m <- as.matrix(gdist)
gdist_long <- data.frame(col=colnames(gdist_m)[col(gdist_m)], row=rownames(gdist_m)[row(gdist_m)], dist=c(gdist_m))
gdist_long %<>%
filter(dist != 0) %>%
rename(fst = dist) %>%
mutate(lin_fst = fst/(1-fst))
gdist_linear <- as.dist(xtabs(gdist_long[, 4] ~ gdist_long[, 2] + gdist_long[, 1]))
gdist_linear
## Bear Creek Coos River Kilchis River Miami River
## Coos River 3.368822e-02
## Kilchis River 3.592550e-02 3.063666e-02
## Miami River 2.585195e-02 3.293993e-02 3.623673e-03
## Mill Creek 1.811388e-02 4.309397e-02 1.771882e-02 1.389308e-02
## Nehalem River 2.233950e-02 2.737190e-02 3.429612e-03 3.814976e-03
## Simpson Creek 3.013459e-02 5.467444e-02 2.355644e-02 1.779263e-02
## Whiskey Creek 4.390067e-02 4.223248e-02 2.036984e-02 1.696411e-02
## Mill Creek Nehalem River Simpson Creek
## Coos River
## Kilchis River
## Miami River
## Mill Creek
## Nehalem River 1.397697e-02
## Simpson Creek -4.146714e-05 1.708537e-02
## Whiskey Creek 2.689334e-02 1.525820e-02 2.605198e-02
#reorder
gdist_linear <- as.matrix(gdist_linear)[c("Coos River", "Mill Creek","Simpson Creek", "Bear Creek", "Whiskey Creek", "Kilchis River", "Miami River", "Nehalem River"), c("Coos River", "Mill Creek", "Simpson Creek","Bear Creek", "Whiskey Creek", "Kilchis River", "Miami River", "Nehalem River")]
gdist_linear <- as.dist(gdist_linear)
mt <- mantel(gdist_linear, distance_matrix) #999 permutation, pearson correlation
mt
##
## Mantel statistic based on Pearson's product-moment correlation
##
## Call:
## mantel(xdis = gdist_linear, ydis = distance_matrix)
##
## Mantel statistic r: 0.5394
## Significance: 0.044
##
## Upper quantiles of permutations (null model):
## 90% 95% 97.5% 99%
## 0.437 0.526 0.593 0.625
## Permutation: free
## Number of permutations: 999
geo_dist_long <- data.frame(col=colnames(as.matrix(distance_matrix))[col(as.matrix(distance_matrix))], row=rownames(as.matrix(distance_matrix))[row(as.matrix(distance_matrix))], dist=c(as.matrix(distance_matrix)))
geo_dist_long %<>%
filter(dist != 0)
dist_long <- left_join(gdist_long, geo_dist_long)
## Joining, by = c("col", "row")
ggplot(dist_long)+geom_point(aes(dist, lin_fst))+geom_smooth(aes(dist, lin_fst), method = "lm", se = FALSE)+theme_classic()+xlab("Alongshore Distance")+ylab(expression(paste("F"[ST],"/(1-F"[ST], ")")))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
Mantel Test is marginally significant (p 0.044), and highly explanatory (mantel r statistic = 0.557) suggesting that genetic differentiation among sampling locations is in part driven by IBD.
snp_rda_null <- rda(X ~ 1, data = dbmem)
snp_rda_full <- rda(X ~ . , data = dbmem)
#check that the full model is significant
anova(snp_rda_full) # 0.002 yes it is significant - proceed to variable selection
#what the variance explained
adjR2.rda_full <- RsquareAdj(snp_rda_full)$adj.r.squared
adjR2.rda_full
## [1] 0.009823535
snp_rda_final <- rda(X ~ MEM1 + MEM2 , data = dbmem)
#variable selection
# here we do variable selection with three different forward selection approaches
ordiR2 <- ordiR2step(snp_rda_null, scope = formula(snp_rda_full, data = dbmem))
## Step: R2.adj= 0
## Call: X ~ 1
##
## R2.adjusted
## <All variables> 0.009823535
## + MEM2 0.006952074
## + MEM1 0.002829300
## <none> 0.000000000
##
## Df AIC F Pr(>F)
## + MEM2 1 1519.5 2.6382 0.002 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: R2.adj= 0.006952074
## Call: X ~ MEM2
##
## R2.adjusted
## <All variables> 0.009823535
## + MEM1 0.009823535
## <none> 0.006952074
##
## Df AIC F Pr(>F)
## + MEM1 1 1519.8 1.6757 0.002 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: R2.adj= 0.009823535
## Call: X ~ MEM2 + MEM1
##
ordiR2$anova
ordi <- ordistep(snp_rda_null, scope = formula(snp_rda_full, data = dbmem))
##
## Start: X ~ 1
##
## Df AIC F Pr(>F)
## + MEM2 1 1519.5 2.6382 0.005 **
## + MEM1 1 1520.5 1.6639 0.005 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: X ~ MEM2
##
## Df AIC F Pr(>F)
## - MEM2 1 1520.2 2.6382 0.005 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Df AIC F Pr(>F)
## + MEM1 1 1519.8 1.6757 0.005 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Step: X ~ MEM2 + MEM1
##
## Df AIC F Pr(>F)
## - MEM1 1 1519.5 1.6757 0.005 **
## - MEM2 1 1520.5 2.6458 0.005 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ordi$anova
vif.cca(snp_rda_full)
## MEM1 MEM2
## 1 1
The global model is significant (p < .001) with an adjusted R2 of 0.010. Both variable selection procedures suggest retaining MEM1 and MEM2
#testing signficance of constrained axes (how many are interesting)
rda_axis <- anova.cca(snp_rda_final, by = 'axis', parallel = 3)
rda_axis$p.adj <- p.adjust (rda_axis$`Pr(>F)`, method = 'fdr')
rda_axis
#testing significance of explanatory variables by margin (each conditioned on all the others)
rda_expl <- anova.cca(snp_rda_final, by = 'margin', parallel = 3)
rda_expl$p.adj <- p.adjust (rda_expl$`Pr(>F)`, method = 'holm')
rda_expl$pve<- rda_expl$Variance/sum(rda_expl$Variance)
rda_expl
## tri plot
#site score
rda_sum<-summary(snp_rda_final, axes = c(2))
scores <- as.data.frame(rda_sum$sites)
scores$pop<-genind_2.0$pop
#species scores
arrows<-as.data.frame(rda_sum$biplot)
#plot by pop
ggplot()+geom_point(aes(RDA1, RDA2, color = pop), alpha = 0.7, data = scores)+geom_segment(data=arrows, aes(x=0, y=0, xend=RDA1, yend=RDA2), arrow=arrow(length=unit(0.2,"cm")), color="black")+geom_text(data=arrows, aes(x=RDA1+RDA1*0.4, y=RDA2+RDA2*0.4, label=c("MEM1", "MEM2")), size = 4, color="black")+stat_ellipse(aes(RDA1, RDA2, color = pop), data = scores)+theme_classic()+xlab("Redundant Axis 1\n1.1% of Total Variance")+ylab("Redundant Axis 2\n0.7% of Total Variance")+scale_color_discrete()+scale_color_scico_d()
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
## Too few points to calculate an ellipse
## Too few points to calculate an ellipse
## Warning: Removed 2 row(s) containing missing values (geom_path).
#variable loadings
snp_loading <- as.data.frame(scores(snp_rda_final, display = "species"))
ggplot(data = snp_loading)+geom_histogram(aes(x =scale(RDA1)))+theme_classic()+ggtitle(paste("Full Pacific RDA 1\n",sum(abs(scale(snp_loading$RDA1)) > 3)/2, sep = "" ))+xlab("Z-scores of SNP loadings")+geom_vline(aes(xintercept = 3), color = "red")+geom_vline(aes(xintercept = -3), color = "red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#count the number of outliers
sum(abs(scale(snp_loading$RDA1)) > 3)/2
## [1] 0
#which(scale(snp_loading$RDA1) > 3)
#[1] 275 364
#> snp_loading[275,]
# RDA1 PC1
#Oke_RDDFW29994_56.A 0.2364937 -0.06317717
# > snp_loading[364,]
# RDA1 PC1
# Oke_RDDFW61351_74.G 0.305528 -0.232033
Both RDA axes are significant and explain 1.1% and 0.7 of total variance respectively. Collectively, the two RDAs constrained ~1.82% of variance, given that the global FST in the dataset was 0.0206, this means we were able to explain 89% of among population variation as spatial autocorrelation among samples. The primary axis of constrained variation (RDA1) was driven by MEM2 which largely captures distance away from Yaquina. The second axis was driven mostly by mem1 and separates Coos Bay samples from all others.
Here we use a bayesian, model based clustering algorithm (STRUCTURE) to infer population structure and estimate admixture proportions of individual samples.
First we need to get our dataset ready for structure: remove linked loci, convert to structure format.
# first lets calculate LD (dartR has a great (fast) ld estimator that works right on genind files, so let's use this)
ldreport <- dartR::gl.report.ld(dartR::gi2gl(genind_2.0, verbose = 0), name = NULL, save = FALSE, nchunks = 2, ncores = 3, chunkname = NULL, verbose = 0)
## Processing a SNP dataset
## Start conversion....
## Please note conversion of bigger data sets will take some time!
## Once finished, we recommend to save the object using >save(object, file="object.rdata")
We’ll prune (keep one) the dataset of any locus-pairs with r2 > 0.2, then convert to STRUCTURE format
unlinked_genind <- genind_2.0[loc=-unique(ldreport[ldreport$R2>0.2,]$loc2)]
rm(ldreport)
#note just sort of crashed through this with a text editor, not easily logged, but the general idea was transpose the data, split columns (diploid to dual haploid) then convert data to integers
df <- genind2df(unlinked_genind)
df <- as.data.frame(t(df))
write_tsv(df, "genotype_data/all.str.tmp")
#do stuff here
# Coos 1
# Yaquina 2
# Nehalem 3
# Netarts 4
# Siletz 5
# Tillamook 6
df <- read_tsv("genotype_data/all.str.tmp", col_names = FALSE)
df <- t(df)
write_tsv(as.data.frame(df), "genotype_data/all.str", col_names = FALSE)
Removed 7 high LD loci
Structure was run in a GUI outside this computation notebook’s environment.
admixture model: admixture, with correlated allele frequency
burnin/mcmc: ran with k=1-6 for 100k iteration to check for convergence used 20,000/40,000 for final runs
replicates: did 10 replicates for k=1-6
Best K was chosen by the evanno method, and estimated in structure harvester.
Replicate results within each K were clumpp’d using the clumpak algorithm on the clumpak webserver
Here we visualize the structure results of the clumpp’d results of all K values
Best K
Best K was 2 according to the Evanno (delta K) method, however, it’s important to remember the bias toward k=2 when differentiation is low or there is no population structure using this method. Delta k literally cannot evaluate K=1. (see note below)
note: delta K suggests best k is two, however, particularly at low differentiation, the delta k method is biased towards k=2 (cullingham 2020), seems like a good place to remind myself that k is a model that doesn’t always fully catpure biological reality and comparing results across different levels of k can proide interesting insights, even when best k is unknown, particularly in the case of low differentiation.
Structure Plots
Next let’s take the clumppd results and make some publication-ready figures.
#import clump results into dataframes
# results are in files k*/majorcluster/clumppfiles/clumpindoutput
# took these files and captured the relevant data with a text editor (original input files are a mess with multiple field separators) and saved to new files
k2 <- read_tsv("./structure/formatted_results/k2.txt")
k3 <- read_tsv("./structure/formatted_results/k3.txt")
k4 <- read_tsv("./structure/formatted_results/k4.txt")
k5 <- read_tsv("./structure/formatted_results/k5.txt")
k6 <- read_tsv("./structure/formatted_results/k6.txt")
levels_pop <- c("Nehalem", "Tillamook", "Netarts", "Siletz", "Yaquina", "Coos")
k2 %<>%
mutate(pop = factor(pop, levels = levels_pop)) %>%
arrange(pop)
k3 %<>%
mutate(pop = factor(pop, levels = levels_pop)) %>%
arrange(pop)
k4 %<>%
mutate(pop = factor(pop, levels = levels_pop)) %>%
arrange(pop)
k5 %<>%
mutate(pop = factor(pop, levels = levels_pop)) %>%
arrange(pop)
k6 %<>%
mutate(pop = factor(pop, levels = levels_pop)) %>%
arrange(pop)
plot_data <- k2 %>%
rownames_to_column(var="id") %>%
# sample the half_pounder and fall fish to smaller size for plot
gather('cluster', 'prob', clust1:clust2) %>%
group_by(id) %>%
mutate(likely_assignment = cluster[which.max(prob)]) %>%
mutate(
assingment_prob = max(prob)) %>%
arrange(likely_assignment, desc(assingment_prob)) %>%
ungroup()
a <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
geom_col(width=1.0) +
facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
scale_y_continuous(expand = c(0, 0), labels = NULL) +
scale_x_discrete(expand = expand_scale(add = 1)) +
theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(),axis.ticks.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank())+
scale_fill_scico_d()
plot_data <- k3 %>%
rownames_to_column(var="id") %>%
gather('cluster', 'prob', clust1:clust3) %>%
group_by(id) %>%
mutate(likely_assignment = cluster[which.max(prob)],
assingment_prob = max(prob)) %>%
arrange(likely_assignment, desc(assingment_prob)) %>%
ungroup()
b <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
geom_col(width=1.0) +
facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
scale_y_continuous(expand = c(0, 0), labels = NULL) +
scale_x_discrete(expand = expand_scale(add = 1)) +
theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(),axis.ticks.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
scale_fill_scico_d()
plot_data <- k4 %>%
rownames_to_column(var="id") %>%
gather('cluster', 'prob', clust1:clust4) %>%
group_by(id) %>%
mutate(likely_assignment = cluster[which.max(prob)],
assingment_prob = max(prob)) %>%
arrange(likely_assignment, desc(assingment_prob)) %>%
ungroup()
c <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
geom_col(width=1.0) +
facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
scale_y_continuous(expand = c(0, 0), labels = NULL) +
scale_x_discrete(expand = expand_scale(add = 1)) +
theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), axis.ticks.y=element_blank(),strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
scale_fill_scico_d()
plot_data <- k5 %>%
rownames_to_column(var="id") %>%
gather('cluster', 'prob', clust1:clust5) %>%
group_by(id) %>%
mutate(likely_assignment = cluster[which.max(prob)],
assingment_prob = max(prob)) %>%
arrange(likely_assignment, desc(assingment_prob)) %>%
ungroup()
d <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
geom_col(width=1.0) +
facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
scale_y_continuous(expand = c(0, 0), labels = NULL) +
scale_x_discrete(expand = expand_scale(add = 1)) +
theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(),axis.ticks.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank())+scale_fill_scico_d()
plot_data <- k6 %>%
rownames_to_column(var="id") %>%
gather('cluster', 'prob', clust1:clust6) %>%
group_by(id) %>%
mutate(likely_assignment = cluster[which.max(prob)],
assingment_prob = max(prob)) %>%
arrange(likely_assignment, desc(assingment_prob)) %>%
ungroup()
e <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
geom_col(width=1.0) +
facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
scale_y_continuous(expand = c(0, 0), labels = NULL) +
scale_x_discrete(expand = expand_scale(add = 1)) +
theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(),axis.ticks.y=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_text(angle = 90)) +
scale_fill_scico_d()
cowplot::plot_grid(a,b,c,d,e, rel_heights = c(1,1,1,1,2.0) ,ncol=1)
Do we have enough data and is there sufficient differentiation to assign individuals back to their basin of origin? Let’s run rubias to check.
# we need to make some changes to the dataset (nucleotide to integer, once column per allele etc)
df <- genind2df(unlinked_genind)
df %<>%
rownames_to_column("indiv")
write_tsv(df, "genotype_data/unlinked_raw.txt")
#write_tsv(rogue_baseline, "rogue_baseline.txt")
#made changes using regex in a text editor
# first split columns of genotype data
# then duplicated headers: find: ([a-zA-Z0-9\-_]+), replace: \1\t\1_1
baseline <- read_tsv("GSI/baseline.txt")
baseline %<>%
mutate(across(everything(), as.character)) %>%
mutate(collection = repunit) %>%
relocate(collection , .after = repunit)
#actually, no, let's put the colelction in there correctly
baseline %<>%
left_join(select(genos_2.0, sample, detailed_location), by = c("indiv" = "sample")) %>%
mutate(collection = detailed_location) %>%
select(-c(detailed_location))
sa <- self_assign(reference = baseline, gen_start_col = 5)
## Summary Statistics:
##
## 235 Individuals in Sample
##
## 318 Loci: Oke_ACOT_100, Oke_AhR1_278, Oke_APOB_60, Oke_arf_319, Oke_ATP5L_105, Oke_brd2_118, Oke_brp16_65, Oke_CATB_60, Oke_ccd16_77, Oke_CD123_62, Oke_cjo57_86, Oke_CKS_389, Oke_CKS1_70, Oke_CO1A1_72, Oke_ctgf_105, Oke_CTR2_82, Oke_DBLOH_79, Oke_DCXR_87, Oke_DM20_548, Oke_e2ig5_50, Oke_EF2_394, Oke_eif4ebp2_64, Oke_eif4g1_43, Oke_f5_71, Oke_glrx1_78, Oke_GNMT_100, Oke_GPDH_191, Oke_H2AX_72, Oke_HP_182, Oke_hsc71_199, Oke_il_1racp_67, Oke_IL8r_272, Oke_IL8r2_406, Oke_lactb2_71, Oke_LAMP2_186, Oke_mcfd2_86, Oke_METK2_97, Oke_mgll_49, Oke_MLRN_63, Oke_nc2b_148, Oke_ndub3_58, Oke_PDIA3_082, Oke_pnrc2_78, Oke_PPA2_635, Oke_psmd9_057, Oke_psmd9_188, Oke_rab5a_117, Oke_RAD10028_44, Oke_RAD10173_41, Oke_RAD10459_85, Oke_RAD10591_67, Oke_RAD10676_50, Oke_RAD10719_31, Oke_RAD11183_63, Oke_RAD11379_85, Oke_RAD11444_75, Oke_RAD11500_80, Oke_RAD11690_33, Oke_RAD11918_57, Oke_RAD11928_30, Oke_RAD11999_36, Oke_RAD12038_34, Oke_RAD12294_71, Oke_RAD12377_41, Oke_RAD12415_71, Oke_RAD12909_51, Oke_RAD13522_47, Oke_RAD14303_73, Oke_RAD14679_56, Oke_RAD14962_45, Oke_RAD16205_61, Oke_RAD1635_77, Oke_RAD16763_78, Oke_RAD16805_31, Oke_RAD17085_70, Oke_RAD17316_60, Oke_RAD17332_67, Oke_RAD19121_72, Oke_RAD19883_47, Oke_RAD21307_43, Oke_RAD2158_44, Oke_RAD22662_43, Oke_RAD2414_54, Oke_RAD24191_34, Oke_RAD2522_75, Oke_RAD27616_72, Oke_RAD2772_76, Oke_RAD2812_43, Oke_RAD2827_56, Oke_RAD3131_64, Oke_RAD3143_30, Oke_RAD3223_33, Oke_RAD3480_76, Oke_RAD3664_73, Oke_RAD369_38, Oke_RAD3693_30, Oke_RAD3715_76, Oke_RAD3762_78, Oke_RAD3861_38, Oke_RAD3907_37, Oke_RAD3938_71, Oke_RAD3995_44, Oke_RAD4286_77, Oke_RAD4538_78, Oke_RAD4787_63, Oke_RAD4875_62, Oke_RAD5156_68, Oke_RAD5276_66, Oke_RAD5434_50, Oke_RAD5457_49, Oke_RAD5615_34, Oke_RAD5734_46, Oke_RAD5951_44, Oke_RAD7067_53, Oke_RAD715_49, Oke_RAD7431_40, Oke_RAD7512_33, Oke_RAD7883_49, Oke_RAD7936_41, Oke_RAD8018_38, Oke_RAD8326_32, Oke_RAD8335_79, Oke_RAD8372_31, Oke_RAD8698_82, Oke_RAD8799_30, Oke_RAD8814_56, Oke_RAD8930_48, Oke_RAD905_85, Oke_RAD9273_49, Oke_RAD9447_73, Oke_ras1_249, Oke_RDDFW10186_42, Oke_RDDFW112_17, Oke_RDDFW13161_74, Oke_RDDFW13803_76, Oke_RDDFW14498_66, Oke_RDDFW14591_59, Oke_RDDFW14903_53, Oke_RDDFW15717_46, Oke_RDDFW16781_68, Oke_RDDFW16886_37, Oke_RDDFW17423_60, Oke_RDDFW17478_56, Oke_RDDFW18411_78, Oke_RDDFW19534_51, Oke_RDDFW19665_62, Oke_RDDFW19805_61, Oke_RDDFW19807_50, Oke_RDDFW19817_73, Oke_RDDFW19905_31, Oke_RDDFW20179_61, Oke_RDDFW20811_77, Oke_RDDFW20999_31, Oke_RDDFW21285_59, Oke_RDDFW21890_73, Oke_RDDFW23444_56, Oke_RDDFW24615_60, Oke_RDDFW24690_58, Oke_RDDFW25476_76, Oke_RDDFW25539_63, Oke_RDDFW27843_66, Oke_RDDFW28182_57, Oke_RDDFW28560_20, Oke_RDDFW29139_38, Oke_RDDFW29994_56, Oke_RDDFW30681_49, Oke_RDDFW32214_64, Oke_RDDFW36545_52, Oke_RDDFW36920_78, Oke_RDDFW37304_51, Oke_RDDFW37547_71, Oke_RDDFW37906_35, Oke_RDDFW37977_64, Oke_RDDFW38189_70, Oke_RDDFW38998_48, Oke_RDDFW39056_71, Oke_RDDFW41741_28, Oke_RDDFW41894_64, Oke_RDDFW41968_67, Oke_RDDFW44691_36, Oke_RDDFW45450_75, Oke_RDDFW46094_46, Oke_RDDFW4669_39, Oke_RDDFW46886_62, Oke_RDDFW47500_38, Oke_RDDFW47763_68, Oke_RDDFW47838_76, Oke_RDDFW48220_71, Oke_RDDFW48558_66, Oke_RDDFW48576_24, Oke_RDDFW48705_55, Oke_RDDFW49372_62, Oke_RDDFW50553_28, Oke_RDDFW51270_33, Oke_RDDFW51705_44, Oke_RDDFW52567_35, Oke_RDDFW52773_67, Oke_RDDFW53380_67, Oke_RDDFW57028_36, Oke_RDDFW57213_78, Oke_RDDFW58551_52, Oke_RDDFW58967_48, Oke_RDDFW59084_41, Oke_RDDFW6089_75, Oke_RDDFW61097_36, Oke_RDDFW61270_26, Oke_RDDFW61351_74, Oke_RDDFW61394_55, Oke_RDDFW62184_66, Oke_RDDFW63326_23, Oke_RDDFW63766_37, Oke_RDDFW64078_73, Oke_RDDFW64969_46, Oke_RDDFW65679_48, Oke_RDDFW65809_61, Oke_RDDFW65817_59, Oke_RDDFW6606_46, Oke_RDDFW66307_18, Oke_RDDFW66740_41, Oke_RDDFW67033_73, Oke_RDDFW6714_54, Oke_RDDFW67894_45, Oke_RDDFW68590_61, Oke_RDDFW6872_56, Oke_RDDFW69367_28, Oke_RDDFW69501_76, Oke_RDDFW69741_55, Oke_RDDFW69778_62, Oke_RDDFW69792_54, Oke_RDDFW70556_28, Oke_RDDFW70829_40, Oke_RDDFW70902_20, Oke_RDDFW71171_55, Oke_RDDFW71319_78, Oke_RDDFW71747_19, Oke_RDDFW71866_73, Oke_RDDFW72082_30, Oke_RDDFW72091_54, Oke_RDDFW72178_38, Oke_RDDFW72464_65, Oke_RDDFW72515_22, Oke_RDDFW72580_61, Oke_RDDFW73626_31, Oke_RDDFW73828_41, Oke_RDDFW74107_70, Oke_RDDFW75268_70, Oke_RDDFW75348_40, Oke_RDDFW76878_51, Oke_RDDFW78157_36, Oke_RDDFW78789_24, Oke_RDDFW80059_43, Oke_RDDFW80154_68, Oke_RDDFW80662_52, Oke_RDDFW89357_49, Oke_RDDFW89443_55, Oke_RDDFW90259_25, Oke_RDDFW90437_74, Oke_RDDFW9156_37, Oke_RDDFW945_50, Oke_RDDFW94768_76, Oke_RDDFW9886_52, Oke_RH1op_245, Oke_RS9_379, Oke_RSPRY1_106, Oke_slc1a3a_86, Oke_sylc_90, Oke_TCP1_78, Oke_TCTA_99, Oke_thic_84, Oke_txnrd1_74, Oke_u1_519, Oke_U1002_165, Oke_U1008_83, Oke_U1015_255, Oke_U1017_52, Oke_U1019_218, Oke_U1020_75, Oke_U1023_147, Oke_U1024_113, Oke_u200_385, Oke_U2001_629, Oke_U2005_62, Oke_U2006_109, Oke_U2010_94, Oke_U2011_107, Oke_U2016_118, Oke_U2017_87, Oke_U2019_112, Oke_U2020_51, Oke_U2022_101, Oke_U2024_93, Oke_U2025_86, Oke_U2026_64, Oke_U2029_79, Oke_U2031_37, Oke_U2032_74, Oke_U2033_122, Oke_U2034_55, Oke_U2035_54, Oke_U2037_76, Oke_U2041_84, Oke_U2043_51, Oke_U2045_43, Oke_U2048_91, Oke_U2049_99, Oke_U2053_60, Oke_U2054_58, Oke_U2056_90, Oke_U2057_80, Oke_u216_222, Oke_U305_307, Oke_U504_228, Oke_U506_110, Oke_U507_087, Oke_U507_286, Oke_U509_219, Oke_U511_271, Oke_U514_150
##
## 6 Reporting Units: Coos, Yaquina, Nehalem, Netarts, Siletz, Tillamook
##
## 8 Collections: Coos River, Mill Creek, Nehalem River, Whiskey Creek, Bear Creek, Kilchis River, Miami River, Simpson Creek
##
## 1.64% of allelic data identified as missing
#summarise by reporting unit
sa_to_repu <- sa %>%
group_by(indiv, collection, repunit, inferred_repunit) %>%
summarise(repu_scaled_like = sum(scaled_likelihood))
# for each individual, assign to most likely reporting unit
sa_assign <- sa_to_repu %>%
group_by(indiv) %>%
slice_max(repu_scaled_like)
sa_assign$correct_assignment <- sa_assign$repunit == sa_assign$inferred_repunit
sum(sa_assign$correct_assignment/nrow(sa_assign))
## [1] 0.7106383
sa_assign %>%
group_by(repunit) %>%
summarise(correct_assignment_rate = sum(correct_assignment ==TRUE)/n(), sample_size = n())
LOO Self-assignment assigns (maximum scaled likelihood) reference individuals back to correct reporting unit (basin) 71% of time, however there was a high degree of variance among basins, with the assignment rate highly dependent on sample size.
Let’s do a similar analysis on a simulated mixture.
Here we conduct a 500 simulations of a mixture of 200 samples drawn at equal rates from the reporting units.
ref_sims_no_prior <- assess_reference_loo(reference = baseline,
gen_start_col = 5,
reps = 500,
mixsize = 200,
)
tmp <- ref_sims_no_prior %>%
group_by(iter, repunit) %>%
summarise(true_repprop = sum(true_pi),
reprop_posterior_mean = sum(post_mean_pi),
repu_n = sum(n)) %>%
mutate(repu_n_prop = repu_n / sum(repu_n))
ggplot(tmp, aes(x = true_repprop, y = reprop_posterior_mean, colour = repunit)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
facet_wrap(~ repunit)+scale_color_scico_d()
Using equal priors acorss the two reporting units the true vs inferred mixture proprotions (above) show higly variable residuals in estimation of mixing proportion. We tend to assume more fish in the mixture are from systems with high samples sizes and underassign to those with low
Now let’s check how reliable an individual assignment (posterior probability) is from these simulations.
ref_sims_no_prior_indivs <- assess_reference_loo(reference = baseline,
gen_start_col = 5,
reps = 500,
mixsize = 200,
return_indiv_posteriors = TRUE)
# summarise things
repu_pofzs <- ref_sims_no_prior_indivs$indiv_posteriors %>%
filter(repunit == simulated_repunit) %>%
group_by(iter, indiv, simulated_collection, repunit) %>% # first aggregate over reporting units
summarise(repu_PofZ = sum(PofZ)) %>%
ungroup() %>%
arrange(repunit, simulated_collection) %>%
mutate(simulated_collection = factor(simulated_collection, levels = unique(simulated_collection)))
#> `summarise()` regrouping output by 'iter', 'indiv', 'simulated_collection' (override with `.groups` argument)
# also get the number of simulated individuals from each collection
num_simmed <- ref_sims_no_prior_indivs$indiv_posteriors %>%
group_by(iter, indiv) %>%
slice(1) %>%
ungroup() %>%
count(simulated_collection)
# note, the last few steps make simulated collection a factor so that collections within
# the same repunit are grouped together in the plot.
# now, plot it
ggplot(repu_pofzs, aes(x = simulated_collection, y = repu_PofZ)) +
geom_boxplot(aes(colour = repunit)) +
geom_text(data = num_simmed, mapping = aes(y = 1.025, label = n), angle = 90, hjust = 0, vjust = 0.5, size = 3) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 9, vjust = 0.5)) +
ylim(c(0, 1.25))+scale_color_scico_d()+theme_classic()
This follows the result above. Very low assignment accuracy at the individual level at all but Tillamook and Yaquina.
Is fst = 0.002 sufficient for GSI within a basin? Let’s check
coll_pofzs <- ref_sims_no_prior_indivs$indiv_posteriors %>%
filter(collection == simulated_collection) %>%
arrange(repunit, simulated_collection) %>%
mutate(simulated_collection = factor(simulated_collection, levels = unique(simulated_collection)))
#> `summarise()` regrouping output by 'iter', 'indiv', 'simulated_collection' (override with `.groups` argument)
# also get the number of simulated individuals from each collection
num_simmed <- ref_sims_no_prior_indivs$indiv_posteriors %>%
group_by(iter, indiv) %>%
slice(1) %>%
ungroup() %>%
count(simulated_collection)
# note, the last few steps make simulated collection a factor so that collections within
# the same repunit are grouped together in the plot.
# now, plot it
ggplot(coll_pofzs, aes(x = simulated_collection, y = PofZ)) +
geom_boxplot(aes(colour = simulated_collection)) +
geom_text(data = num_simmed, mapping = aes(y = 1.025, label = n), angle = 90, hjust = 0, vjust = 0.5, size = 3) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 9, vjust = 0.5)) +
ylim(c(0, 1.25))+scale_color_scico_d()
No, pretty poor assignment to Kilchis and Miami.
In the panel optimization run, several samples failed to genotype. These were included in this panel as well to determine if the failure to genotype was due to sample quality or other issues.
# import data from test plates
test_run_genos <- read_csv("previous_run_genotype_data/Oke_GTs_0.1.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## `Raw Reads` = col_double(),
## `On-Target Reads` = col_double(),
## `%On-Target` = col_double(),
## `%GT` = col_double(),
## IFI = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
## Warning: 95 parsing failures.
## row col expected actual file
## 1 -- 356 columns 357 columns 'previous_run_genotype_data/Oke_GTs_0.1.csv'
## 2 -- 356 columns 357 columns 'previous_run_genotype_data/Oke_GTs_0.1.csv'
## 3 -- 356 columns 357 columns 'previous_run_genotype_data/Oke_GTs_0.1.csv'
## 4 -- 356 columns 357 columns 'previous_run_genotype_data/Oke_GTs_0.1.csv'
## 5 -- 356 columns 357 columns 'previous_run_genotype_data/Oke_GTs_0.1.csv'
## ... ... ........... ........... ............................................
## See problems(...) for more details.
# import unfiltered genotype data
raw_genos <- read_csv("genotype_data/Oke_2021_coastal_genos_0.1.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## `Raw Reads` = col_double(),
## `On-Target Reads` = col_double(),
## `%On-Target` = col_double(),
## `%GT` = col_double(),
## IFI = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
## Warning: 383 parsing failures.
## row col expected actual file
## 1 -- 356 columns 357 columns 'genotype_data/Oke_2021_coastal_genos_0.1.csv'
## 2 -- 356 columns 357 columns 'genotype_data/Oke_2021_coastal_genos_0.1.csv'
## 3 -- 356 columns 357 columns 'genotype_data/Oke_2021_coastal_genos_0.1.csv'
## 4 -- 356 columns 357 columns 'genotype_data/Oke_2021_coastal_genos_0.1.csv'
## 5 -- 356 columns 357 columns 'genotype_data/Oke_2021_coastal_genos_0.1.csv'
## ... ... ........... ........... ..............................................
## See problems(...) for more details.
raw_genos %<>%
mutate(Sample = str_extract(Sample, "[:upper:][:lower:]{2}[AJCU][RC]\\d{2}\\w{4}_\\d{4}")) %>%
dplyr::select(Sample, `Raw Reads`, `On-Target Reads`, `%On-Target`, `%GT`, IFI)
shared_genos <- raw_genos %>%
inner_join(dplyr::select(test_run_genos,Sample, `Raw Reads`, `On-Target Reads`, `%On-Target`, `%GT`, IFI), by = c("Sample" = "Sample"))
ggplot(data = shared_genos)+geom_point(aes(`%GT.x`, `%GT.y`))+geom_abline(aes(slope = 1, intercept = 0))+theme_classic()+xlab("this run %GT")+ylab("previous run %GT")
ggplot(data = shared_genos)+geom_point(aes(`%On-Target.x`, `%On-Target.y`))+geom_abline(aes(slope = 1, intercept = 0))+theme_classic()+xlab("this run %On-Target")+ylab("previous run %On-Target")
Most samples improved when run a second time. Need to check if they were rextracted or purified on this run, or just multiplexed into the library a second time
Some samples have metadata of alternative tissue sampling, let’s compare these to see if any worked better than others
#clean up the metadata
# a lot of the tissue type data is not useful, different names used for the likely the same tissue type, multiple tissue types in one sample, tried to clean up best I could: all muscle grouped together, mulitple tissue flagged, unclear markers as unknown
# tissue is fin unless otherwise noted
#prep the data
tissue_data <- read_tsv("metadata/tissue_type.txt")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## sample = col_character(),
## tissue_type_raw = col_character(),
## tissue_type = col_character()
## )
tissue_data %<>%
mutate(tissue_type = case_when(is.na(tissue_type) ~ "fin",
TRUE ~ tissue_type))
raw_genos %<>%
left_join(tissue_data, by = c("Sample" = "sample")) %>%
mutate(tissue_type = case_when(str_detect(Sample, "OkeCC13") ~ "scale",
TRUE ~ tissue_type))
raw_genos %>%
group_by(tissue_type) %>%
summarise(mean = mean(`%On-Target`), n = n()) %>%
arrange(desc(n))
raw_genos %>%
group_by(tissue_type) %>%
summarise(mean = mean(`On-Target Reads`), n = n()) %>%
arrange(desc(n))
#only plot major tissue sample type(NA, multiple, and n < 5 excluded, also exclude archival samples (scales))
ggplot(data = filter(raw_genos, tissue_type %in% c("fin", "operculum punch", "muscle")))+geom_boxplot(aes(tissue_type, `On-Target Reads`))+theme_bw()
ggplot(data = filter(raw_genos, tissue_type %in% c("fin", "operculum punch", "muscle")))+geom_boxplot(aes(tissue_type, `%On-Target`))+theme_bw()
summary(aov(data = filter(raw_genos, tissue_type %in% c("fin", "operculum punch", "muscle")), `On-Target Reads` ~ tissue_type ))
## Df Sum Sq Mean Sq F value Pr(>F)
## tissue_type 2 2.077e+11 1.038e+11 1.068 0.345
## Residuals 347 3.375e+13 9.725e+10
summary(aov(data = filter(raw_genos, tissue_type %in% c("fin", "operculum punch", "muscle")), `%On-Target` ~ tissue_type ))
## Df Sum Sq Mean Sq F value Pr(>F)
## tissue_type 2 684 342.2 1.643 0.195
## Residuals 347 72263 208.3
No significant differences between major tissue sample types. Fin performed best, but the mean %On Target Reads was only a 13% improvement over operculum punch and 23% improvement over muscle. Given that muscle was reported to be hard to work with during extractions, perhaps fins and operculum punches should be prioritized.
Could we successfuly extract DNA and genotype samples from archival (2013) scales? Let’s compare the genotyping success between major fresh tissue type and archived scales.
raw2 <- filter(raw_genos, tissue_type %in% c("fin", "operculum punch", "muscle", "scale"))
raw2 %<>%
mutate(fresh_scale = case_when(tissue_type %in% c("fin", "operculum punch", "muscle") ~ "fresh",
tissue_type == "scale" ~ "scale"))
kable(raw2 %>%
group_by(fresh_scale) %>%
summarise(mean_on_target = mean(`On-Target Reads`), mean_on_target_perc = mean(`%On-Target`), mean_GT_perc = mean(`%GT`)))
| fresh_scale | mean_on_target | mean_on_target_perc | mean_GT_perc |
|---|---|---|---|
| fresh | 261139.6 | 18.60226 | 85.6968 |
| scale | 142650.7 | 10.64500 | 74.7710 |
summary(aov(data =raw2, `%On-Target` ~ fresh_scale ))
## Df Sum Sq Mean Sq F value Pr(>F)
## fresh_scale 1 616 615.6 2.96 0.0862 .
## Residuals 358 74458 208.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(data =raw2, `On-Target Reads` ~ fresh_scale ))
## Df Sum Sq Mean Sq F value Pr(>F)
## fresh_scale 1 1.365e+11 1.365e+11 1.427 0.233
## Residuals 358 3.425e+13 9.568e+10
kable(raw2 %>%
group_by(tissue_type) %>%
summarise(n = n(), n_retained = sum(`%GT`> 80)/n()))
| tissue_type | n | n_retained |
|---|---|---|
| fin | 234 | 0.8974359 |
| muscle | 45 | 0.7111111 |
| operculum punch | 71 | 0.7605634 |
| scale | 10 | 0.8000000 |
a <- ggplot(data = raw2)+geom_density(aes(x = `On-Target Reads`, color = tissue_type, fill = tissue_type), alpha =0.5) + scale_fill_scico_d()+scale_color_scico_d()+theme_classic()+theme(legend.title = element_blank())
b <- ggplot(data = raw2)+geom_density(aes(x = `%On-Target`/100, color = tissue_type, fill = tissue_type), alpha = 0.5)+scale_fill_scico_d()+scale_color_scico_d()+theme_classic()+theme(legend.title = element_blank())+xlab("Proportion On-Target")
c <- ggplot(data = raw2)+geom_density(aes(x = 1-(`%GT`/100), color = tissue_type, fill = tissue_type), alpha = 0.5)+scale_fill_scico_d()+scale_color_scico_d()+theme_classic()+theme(legend.title = element_blank())+xlab("Proportion Missing Data")
cowplot::plot_grid(a,b,c, ncol = 1)
Only half of the scale samples in the metadata were actually multiplexed into the library (ask Cristin and Sandra why? ws it low DNA quality? if so that affects the interpretation of these results). Answer here, only extracted 10 of the scales, wanted to save the samples if extractions didn’t work.
Scale samples have roughly 1/2 the number and proportion of on target reads, but similar rates of average missing data. There was not a significant differnence in the proportion of on-target reads. There was an interesting pattern of variance in the genotyping success of scale vs fin samples. While both tissue types had similar proportions of samples that completely failed genotypin (>20% missing data, about 20% of samples from each), there was much higher variance among scale samples in missing data rate among samples with <20% missing data.
Together these results suggest that GTseq study using scale samples is feasible, but we should be cautious about depths. Scale samples will require approximately 2x depth to reach a similar number of on-target reads as fresh samples.